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In a recent publication, we discussed the benefits and 
cautions of using secondary data analyses in research on 
lifestyle and health behavior [1]. We provided some guidelines 
about the use of secondary data in terms of the contributions 
that can be made and at the same time considerations 
necessary in using data that are collected by someone else. 
The use of secondary data to explore social and health issues 
results in being able to provide information about important 
issues in a timely fashion. Secondary data can answer two 
types of questions: descriptive and analytical [2]. Hence, the 
information can be used to describe events or trends or it 
can be used to examine relationships among variables cross- 
sectionally or longitudinally. 


What are secondary data? Secondary data refer to the use 
of information that is collected by someone other than the 
original user and/or for purposes other than those originally 
intended. Common sources of secondary data can include 
census information, information collected by government 
departments on health and other behaviors, organization 
records such as hospital, housing, and arrest data, and 
information that was originally collected for research purposes 
such as public health surveys [2]. Data that can be used include 
both large-scale surveys and data collected using qualitative 
methods. Information can be collected on a regular basis from 
different samples [repeated cross-sectional] or from the same 
individuals over time [longitudinal]. 


It is crucial that in looking at using secondary data, the 
original parameters of data collection, question-wording, 
and sampling are clear and any sources of biases are noted 


| ® Check for updates 


[3]. Understanding the sources of the information, how it was 
collected, and how it was analyzed are precursors to the future 
appropriate use of the information [3,4]. 


Initially, we noted several Benefits. First, alot of information 
is available from available sources and that can be used to 
make important content contributions to knowledge, as well 
as providing the backdrop for future research. Second, since 
the information is already available, research can be conducted 
promptly, without the timelines needed for submitting funding 
proposals. While ethics approval may be required if the 
information is being used for a purpose not originally proposed 
or have identifiers or sensitive information such as found 
on clinical records [2], ethics approval for the original data 
collection would be available. Continued use of secondary data 
gives researchers who have conducted the original information 
that they can use to justify the continuation of their research. In 
other words, the information gathered and used in secondary 
analyses often raises questions and can provide the backdrop 
for additional research. 


The original discussion was followed by the presentation of 
Cautions: As noted, the available data may not provide all of the 
information of interest. This situation can lead to additional 
research, but it may leave some questions and explanations 
open. Second, response rates to surveys have been decreasing 
over time. In a detailed analysis of surveys conducted by Health 
and Human Services in the US, the decline from 1995-2015 
ranged from 10% to over 20% [5]. Not only were there declines 
in response rates, but there were declines in completion rates 
and increases in item non-response. These patterns leave open 
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the issue of exactly how representative the responses are. In 
addition, questions may not be worded as precisely as we would 
like to address specific areas currently under consideration It is 
still possible to learn from existing information and to use that 
information to provide the basis for new research and policy 
development. 


Additional considerations 


The purpose of this article is to outline some additional 
benefits and cautions that have come to light since our original 
publication and expand on some of the positive and negative 
issues associated with the use of secondary data. 


Comparisons with other use: First, the data have been used, 
at least to some extent — so information on the analysis and 
some findings are available for comparisons and guidelines 
for potential future analyses. The use of definitions from 
the original research can help to provide guidance for future 
analyses; for example, cessation of a tobacco product can be 
defined as no use during a specific time such as one year. It 
is recommended that additional analyses include attempts to 
understand and verify original analyses so that future analyses 
are consistent with earlier findings. 


Analytic issues: Surveys that are designed to be 
representative of a population often require complex statistical 
methodology and statistical programs that incorporate sample 
weights, stratification, and clustering. Users of these types 
of secondary survey data should ensure that they use the 
required sampling weights and design-based data analyses 
associated with the survey when attempting to replicate the 
original findings and when planning additional analyses. To do 
otherwise could lead to inappropriate estimates and standard 
errors of these estimates. For example, Kim, et al. [6] examined 
the percentage of research papers that employed appropriate 
statistical methodology while analyzing information from a 
secondary dataset, the Korean National Health and Nutrition 
Examination Survey [KNHANES]. Over 80% of published 
studies using the KNHANES used statistical analyses that 
were appropriate for datasets generated by a simple random 
sampling method but not analyses requiring incorporation 
of the complex survey design used in the collection of the 
information. The consequences of the inappropriate analyses 
were that the means and standard errors of the ordinary 
statistical analyses and the design-based analyses differed; the 
standard errors from the design-based analyses tended to be 
larger [6]. 


The comparison with earlier analyses does not mean 
that the potential use of information has to be limited to the 
original purposes of data collection. For example, prevalence 
estimation of patterns of drug use and other risk behaviors 
were among the original purposes of surveys conducted by 
the Centre for Addiction and Mental Health [CAMH] [7,8]. 
Subsequently, the relationships between sociodemographic 
variables and patterns of tobacco use among elementary and 
high school students were examined and provided additional 
information from the existing survey [9]. 


Particularly in secondary data analysis, one should avoid 
data dredging, that is, performing many statistical procedures 
and reporting only on those which are statistically significant. 
One should construct hypotheses from theoretical arguments, 
or as suggested by other studies, and then perform analyses on 
the secondary data. Even in this case, when working at a level 
of 0.05, about 5% of the hypotheses tested will be falsely found 
to be statistically significant. 


If it is the case that the secondary analyses do not meet 
the standards for the required complex design and weightings 
assigned to the original data, this needs to be acknowledged. It 
is important that additional analyses do not call into question 
either the findings from the original analyses as well as those 
from additional analyses, but that the objectives are specified. 
However, if the complex design information can be used in 
the analysis, the estimates of variance are more precise. For 
example, analyses of the relationship between vaping and 
cigarette smoking in high school students from the CAMH 
surveys have been examined to learn about possible risks and 
protective factors and incorporated the complex designs in the 
subgroups included [9] 


Combining datasets: Researchers are often interested in 
combining datasets from different years in order to 1] increase 
the sample size to get more precise estimates or 2] compute 
averages or differences in estimates from different years. One 
method is to stack the datasets for different years on top of 
each other and perform a “pooled” statistical analysis, using 
appropriate weightings. Some datasets come in an appropriate 
format, e.g., the National Survey on Drug Use and Health 
{10], but any statistical computer package can be used to 
concatenate datasets from different years. Lee, et al. [11] used a 
stacked dataset to estimate averages over years and differences 
between years, whereas Pederson, et al. [12] looked at changes 
in outcomes over years using a regression model. An alternative 
approach, used by Thomas and Wannell [13], is to compute 
estimates and their variances for each year and then calculate 
means [and their variances] over two or more years. Pederson, 
et al. [9] used the same methodology to compute differences 
in means and percentages between two years. In the analysis 
of pooled data, it is possible to use propensity score matching 
to examine the change in relationships over different years. 
However, there are challenges with this methodology; see, for 
example Norris [14]. 


Data quality: Some databases may be well developed 
methodologically, [i.e., Statistics Canada [Statcan] with their 
questionnaire development, sampling, etc.] and have included 
checks on data quality, missing values and misinformation. 
As an example, the Better Outcomes Registry & Network 
[BORN], Ontario’s birth registry on pregnancy, birth and the 
early childhood, was independently evaluated for accuracy of 
selected data elements entered into the database [15]. Data 
entry includes data collection methods at different points 
during pregnancy, birth and childhood, and multiple data 
collection of the same data elements during the course of care. 
These data form a unified maternal-newborn record based on 
robust linking and matching algorithms. Accuracy assessment 
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of the database indicated agreement from 56.9 to 99.8%, with 
76% of the data elements with greater than 90% agreement 
[5]. Hence, users of secondary databases should be aware of 
and assess what procedures were used to improve quality 
of original data collection and to eliminate inconsistent or 
incomplete information. 


Things to watch for: When dealing with information from 
longitudinal or repeated cross-sectional surveys, there are 
some additional issues that need to be considered. For example, 
question wording can change. Question sequencing can change 
as well as skip patterns. As a result, attempting to examine 
trends in use of substance over time may need to take into 
consideration variations in question sequencing, wording and 
response categories [16,17]. For example, evidence indicates 
that survey respondents’ have provided different frequency 
estimates for behaviors based on the range of the response 
categories that were offered in a closed-ended question 
({18]. Additionally, the social and political environments can 
change; for example, cigarette smoking was acceptable forty 
years ago and now it is not. Therefore, smokers may be 
more likely to misrepresent their behavior in light of social 
desirability concerns [18]. In these cases, it is particularly 
difficult to decide how information from a range of years can 
be considered together. As a result, it is critical to understand 
the environment and conditions that existed when information 
was originally collected. 


Data collection methods may differ. Information has 
been collected on line, on the telephone or in person. For 
example, information on a health record can be provided by 
the individual patient at one point in time and collected by the 
provider at another. It is not clear what the impact of the use 
of different data collection methods might be. In last few years 
many omnibus surveys and polls have changed from random 
digit dialing telephone survey [Computer Assisted Telephone 
Interviewing-CATI] to probability internet panel [PIP] surveys 
because of time and cost and the predominance of robo calls 
and blocking of unknown telephone numbers. Surveys by 
Hemsworth, et al. [19] documented what can result from the 
use of different data collection procedures. They collected 
information from a CATI [CATI, n = 502] and a PIP survey 
[PANEL, n = 530] to examine differences regarding attitudes 
and behavior toward livestock use and welfare. There was little 
difference in demographics between the two surveys apart from 
highest level of education. However, there were differences in 
both attitudes and behavior toward the red meat industry after 
controlling for education. 


Conclusion 


There can be important contributions to knowledge as well 
as directions for future research and programs that emerge 
from secondary analyses. It is crucial to be aware of differences 
in methods and to acknowledge them and the possible impact 
they may have on responding. These factors do not preclude 
the use of secondary data, but need to be acknowledged when 
attempting to present findings from secondary data. It may be 
the case that the methodological and environmental differences 
can account for what appear to be changes or trends and these 
potential impacts should be noted. It is essential to be aware of 


the many environmental factors that can impact response rate 
and response content. It is also necessary to keep in mind that 
finding relationships and trends over time does not provide 
evidence of causality but only descriptive information about 
the relationship between variables. Moreover, it is important 
to use appropriate statistical methods with secondary datasets 
with complex sampling designs and weightings. 
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